By: Ronald Liu, Rishabh Mathur, Jaanasheen Bala, Shaleen Gorakatar
Pulsars are highly magnetised rotating neutron stars. They emit regular pulses of radio waves from their magnetic poles. These pulses produce a periodically repeating detectable pattern of broadband radio emission that varies with each rotation. Individual pulsars are unique in their emission patterns. Radio waves from pulsars, observed through large radio telescopes, are used by astronomers to estimate the distances of cosmic objects. Possible signal detections, candidates, are averaged over many rotations before classifying them. Radio Frequency Interference (RFI) and noise are behind most detections, making identifying pulsars difficult. The predictive question we aim to answer through this project is: Can we predict the class of our unclassified pulsar star dataset using the four integrated profiles?
University of Manchester’s Pulsar Star dataset will be used. The pulsar candidates in this dataset were collected during the High Time Resolution Universe Survey. Each row represents a pulsar candidate. Eight continuous variables (mean, standard deviation, excess kurtosis and skewness of both the integrated pulse profile and the DM-SNR curve) and one class variable describes each candidate. The class label, represented in the last column, has a value of 1 for a real pulsar and 0 otherwise. The sum of many individual pulsar star pulses, the pulsar’s fingerprint, is the integrated pulse profile. The background radio waves that may interfere with the desired signal is the DM-SNR (Dispersion Measure Signal-to-Noise Ratio) curve.
install.packages("tidyverse")
install.packages("tidymodels")
install.packages("themis")
install.packages("cowplot")
install.packages("repr")
install.packages("GGally")
install.packages("ISLR")
Updating HTML index of packages in '.Library' Making 'packages.html' ... done Updating HTML index of packages in '.Library' Making 'packages.html' ... done Updating HTML index of packages in '.Library' Making 'packages.html' ... done Updating HTML index of packages in '.Library' Making 'packages.html' ... done Updating HTML index of packages in '.Library' Making 'packages.html' ... done Updating HTML index of packages in '.Library' Making 'packages.html' ... done Updating HTML index of packages in '.Library' Making 'packages.html' ... done
library(tidyverse)
library(tidymodels)
library(GGally)
library(repr)
library(cowplot)
library(ISLR)
library(themis)
set.seed(1)
options(repr.matrix.max.rows = 6)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ✔ ggplot2 3.3.6 ✔ purrr 0.3.4 ✔ tibble 3.1.7 ✔ dplyr 1.0.9 ✔ tidyr 1.2.0 ✔ stringr 1.4.0 ✔ readr 2.1.2 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ── ✔ broom 1.0.0 ✔ rsample 1.0.0 ✔ dials 1.0.0 ✔ tune 1.0.0 ✔ infer 1.0.2 ✔ workflows 1.0.0 ✔ modeldata 1.0.0 ✔ workflowsets 1.0.0 ✔ parsnip 1.0.0 ✔ yardstick 1.0.0 ✔ recipes 1.0.1 ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ✖ scales::discard() masks purrr::discard() ✖ dplyr::filter() masks stats::filter() ✖ recipes::fixed() masks stringr::fixed() ✖ dplyr::lag() masks stats::lag() ✖ yardstick::spec() masks readr::spec() ✖ recipes::step() masks stats::step() • Learn how to get started at https://www.tidymodels.org/start/ Registered S3 method overwritten by 'GGally': method from +.gg ggplot2
The first step is to read the file from the web into Jupyter using the read_csv function. The raw dataset has eight numeric variables and one categorical variable, all of the double data type and without column names. We tidied the data by assigning the given names to the columns and changing the categorical variable from double to factor data type.
# WE DOWNLOAD THE DATASET FROM THE WEB, WHICH DOES NOT HAVE COLUMN NAMES
temp <- tempfile()
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip", temp)
pulsar_file <- unz(temp, "HTRU_2.csv")
pulsar <- read_csv(pulsar_file, col_names = FALSE)
# WE ASSIGN COLUMN NAMES ACCORDING TO THE INFORMATION THAT HAS BEEN GIVEN TO US
colnames(pulsar) <- c("mean_ip", "std_ip", "kurt_ip", "skew_ip", "mean_dm_snr", "std_dm_snr", "kurt_dm_snr", "skew_dm_snr", "class")
# WE CHANGE CLASS TO A FACTOR VARIABLE BECAUSE THAT IS THE CATEGORICAL VARIABLE THAT WE ARE TRYING TO PREDICT FOR
pulsar_mutate <- pulsar |>
mutate(class = as_factor(class))
print("Table 1: Tidy pulsar dataset.")
pulsar_mutate
Rows: 17898 Columns: 9 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," dbl (9): X1, X2, X3, X4, X5, X6, X7, X8, X9 ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "Table 1: Tidy pulsar dataset."
| mean_ip | std_ip | kurt_ip | skew_ip | mean_dm_snr | std_dm_snr | kurt_dm_snr | skew_dm_snr | class |
|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> |
| 140.5625 | 55.68378 | -0.2345714 | -0.6996484 | 3.199833 | 19.11043 | 7.975532 | 74.24222 | 0 |
| 102.5078 | 58.88243 | 0.4653182 | -0.5150879 | 1.677258 | 14.86015 | 10.576487 | 127.39358 | 0 |
| 103.0156 | 39.34165 | 0.3233284 | 1.0511644 | 3.121237 | 21.74467 | 7.735822 | 63.17191 | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 119.3359 | 59.93594 | 0.1593631 | -0.74302540 | 21.430602 | 58.87200 | 2.499517 | 4.595173 | 0 |
| 114.5078 | 53.90240 | 0.2011614 | -0.02478884 | 1.946488 | 13.38173 | 10.007967 | 134.238910 | 0 |
| 57.0625 | 85.79734 | 1.4063910 | 0.08951971 | 188.306020 | 64.71256 | -1.597527 | 1.429475 | 0 |
# 1. WE CALCULATE THE AVERAGE OF EACH COLUMN SO WE CAN DETERMINE IF SCALING IS NECESSARY FOR OUR MODEL
pulsar_with_column_means <- pulsar |>
select(mean_ip:skew_dm_snr) |>
map_df(mean)
print("Table 2: Average of each variable from the pulsar dataset.")
pulsar_with_column_means
[1] "Table 2: Average of each variable from the pulsar dataset."
| mean_ip | std_ip | kurt_ip | skew_ip | mean_dm_snr | std_dm_snr | kurt_dm_snr | skew_dm_snr |
|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 111.08 | 46.54953 | 0.4778573 | 1.770279 | 12.6144 | 26.32651 | 8.303556 | 104.8577 |
# 2. WE SPLITT OUR DATA INTO TRAINING SET AND TESTING SET
pulsar_split <- initial_split(pulsar_mutate, prop = 0.75, strata = class)
pulsar_train_split <- training(pulsar_split)
pulsar_test_split <- testing(pulsar_split)
# 3. WE DETERMINE WHICH PREDICTOR VARIABLES APPEAR TO DIFFER THE MOST BETWEEN THE TWO CLASSES, USING GGPAIRS()
options(repr.plot.height = 15, repr.plot.width = 15)
correlation <- ggpairs(pulsar_train_split, mapping = aes(color = class, alpha = 0.5), lower = list(combo=wrap("facethist", bins = 30))) +
labs(caption = "Figure 1: ggpairs plot of training dataset.", title = "Correlation of All Variables") +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5, size = 12))
correlation
# 4. PULSAR OBSERVATION COUNTS FOR EACH CLASS
count_pulsar_train_split <- pulsar_train_split |>
group_by(class) |>
summarize(n = n())
print("Table 3: Class count of pulsar dataset.")
count_pulsar_train_split
# AS THE CLASS IS VERY IMBALANCED, WE WILL HAVE TO FIRST BALANCE OUR CLASSES
[1] "Table 3: Class count of pulsar dataset."
| class | n |
|---|---|
| <fct> | <int> |
| 0 | 12200 |
| 1 | 1223 |
Our chosen predictor variables, kurt_ip and mean_dem_snr have been used in the recipe for the factor variable class. The training data set is downsampled based on the variable class for the recipe. For downsampling, the under_ratio is set to 1 to cut down the data points in class 0 and fix the imbalance. Downsampling is chosen rather than upsampling because the resulting training set is smaller in the case of downsampling (2446 data points for downsampling as compared to 24400 data points for upsampling), allowing for a shorter code run time. After the initial analysis, the need to scale and center out data was identified. All predictors of the dataset have been scaled and centered in the recipe using the step_scale and step_center functions.
# WE WILL ONLY ACCOUNT FOR KURT_IP AND MEAN_DM_SNR VARIABLES
# WE WILL HAVE TO USE STEP_DOWNSAMPLE IN OUR RECIPE TO BALANCE OUR CLASSES
pulsar_recipe <- recipe(class ~ kurt_ip + mean_dm_snr, data = pulsar_train_split) |>
step_downsample(class, under_ratio = 1, skip = TRUE) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
downsampled_pulsar_train_split <- recipe(class ~ kurt_ip + mean_dm_snr, data = pulsar_train_split) |>
step_downsample(class, under_ratio = 1, skip = FALSE) |>
prep() |>
bake(pulsar_train_split)
print("Table 4: Downsampled pulsar dataset with kurt_ip, mean_dm_snr, and class variables.")
downsampled_pulsar_train_split
# SINCE PREPERATION AND BAKING ARE REQUIRED, DOWNSAMPLED PULSAR TRAIN WILL BE USED FOR PLOTTING;
# WE DO NOT WANT SCALED DATA EITHER TO ENSURE INTERPRETABILITY OF PLOT.
[1] "Table 4: Downsampled pulsar dataset with kurt_ip, mean_dm_snr, and class variables."
| kurt_ip | mean_dm_snr | class |
|---|---|---|
| <dbl> | <dbl> | <fct> |
| -0.08654116 | 2.471572 | 0 |
| 0.79954996 | 21.605351 | 0 |
| 0.34043056 | 3.663043 | 0 |
| ⋮ | ⋮ | ⋮ |
| 4.779124 | 129.8637 | 1 |
| 6.377273 | 141.8604 | 1 |
| 0.671724 | 149.8436 | 1 |
Two recipes have been coded for. This is because the unscaled downsampled data is more interpretable and therefore provides a better visualization. The prep() and bake() functions have to be used for this recipe, downsampled_pulsar_train_split. The other recipe, pulsar_recipe, contains the scaled data and will be used for the rest of our classification process. The scaled data doesn’t require the prep() and bake() functions as the workflow does this for us.
# A SENSE OF THE CLASS DISTRIBUTION IN OUR DOWNSAMPLED TRAINING DATA CAN BE GAINED BY ANALYZING
# THE CORRELATION BETWEEN OUR TWO PREDICTOR VARIABLES: KURT IP AND MEAN DM SNR.
options(repr.plot.height = 15, repr.plot.width = 15)
correlation_plot <- ggpairs(downsampled_pulsar_train_split, mapping = aes(color = class, alpha = 0.5), lower = list(combo = wrap("facethist", bins = 30))) +
labs(caption = "Figure 2: ggpairs plot of Kurtosis IP and Mean DM-SNR variables from downsampled training dataset.", title = "Correlation of kurt_ip and mean_dm_snr") +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5, size = 12))
correlation_plot
The model spec for our k-nn classification is built using the nearest_neighbors() function, with the argument neighbors = tune(). 5 fold cross validation,stratified by class, is also used to determine our optimum k. 5 was chosen to reduce the code run time. This is used later in the workflow to determine the optimum K by plotting a graph of accuracy vs K.
# THE MODEL TO DETERMINE K
pulsar_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
# USING CROSS-VALIDATION TO HELP US DETERMINE K
# WE WILL CHOOSE V = 5 INSTEAD OF V = 10 FOR BETTER OPTIMIZATION
pulsar_vfold <- vfold_cv(pulsar_train_split, v = 5, strata = class)
A tibble containing k values from 1 to 10 with the object name k_vals is created to test for values of k between 1 and 10. Using the scaled, centered and downsampled pulsar_recipe, a workflow has been created for the classification. The results of the workflow have been filtered for accuracy in the .metric column. This allows us to plot accuracy vs k to determine the optimum k value. k=8 has been selected as our neighbor after analysing the plot because it provides a relatively high accuracy of ~96% and changing k to a nearby value will not have a large effect on the accuracy.
# WE ARE LOOKING AT K VALUES WHICH ARE BETWEEN 1-10 FOR OUR MODEL
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))
# COMBINING EVERYTHING TO PUT OUR WORKFLOW TOGETHER
pulsar_results <- workflow() |>
add_recipe(pulsar_recipe) |> # recipe
add_model(pulsar_spec) |> # model spec with tune()
tune_grid(resamples = pulsar_vfold, grid = k_vals) |>
collect_metrics()
print("Table 5: Metric results for each K value.")
pulsar_results
# GETTING THE ACCURACIES TO PLOT AN ACCURACY VS K PLOT TO DETERMINE OUR IDEAL K
pulsar_accuracies <- pulsar_results |>
filter(.metric == "accuracy")
print("Table 6: Accuracy results for each K value.")
pulsar_accuracies
# PLOTTING ACCURACY VS K
accuracy_vs_k <- ggplot(pulsar_accuracies, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
labs(x = "Neighbors", y = "Accuracy Estimate", caption = "Figure 3: Accuracy estimate for each K value.", title = "Accuracy vs K") +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5, size = 12))
accuracy_vs_k
[1] "Table 5: Metric results for each K value."
| neighbors | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <dbl> | <int> | <dbl> | <chr> |
| 1 | accuracy | binary | 0.9010665 | 5 | 0.005869089 | Preprocessor1_Model01 |
| 1 | roc_auc | binary | 0.8992382 | 5 | 0.004313812 | Preprocessor1_Model01 |
| 2 | accuracy | binary | 0.9010665 | 5 | 0.005869089 | Preprocessor1_Model02 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 9 | roc_auc | binary | 0.9632487 | 5 | 0.004810662 | Preprocessor1_Model09 |
| 10 | accuracy | binary | 0.9600696 | 5 | 0.003824219 | Preprocessor1_Model10 |
| 10 | roc_auc | binary | 0.9632590 | 5 | 0.004492992 | Preprocessor1_Model10 |
[1] "Table 6: Accuracy results for each K value."
| neighbors | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <dbl> | <int> | <dbl> | <chr> |
| 1 | accuracy | binary | 0.9010665 | 5 | 0.005869089 | Preprocessor1_Model01 |
| 2 | accuracy | binary | 0.9010665 | 5 | 0.005869089 | Preprocessor1_Model02 |
| 3 | accuracy | binary | 0.9415932 | 5 | 0.003834571 | Preprocessor1_Model03 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 8 | accuracy | binary | 0.9593990 | 5 | 0.003759576 | Preprocessor1_Model08 |
| 9 | accuracy | binary | 0.9600696 | 5 | 0.003824219 | Preprocessor1_Model09 |
| 10 | accuracy | binary | 0.9600696 | 5 | 0.003824219 | Preprocessor1_Model10 |
# PULLING ACCURACY FOR K = 8 (CHOOSEN FROM THE PLOT ABOVE)
accuracy_for_8 <- pulsar_accuracies |>
filter(neighbors == 8)
print("Table 7: Accuracy results for K = 8.")
accuracy_for_8
[1] "Table 7: Accuracy results for K = 8."
| neighbors | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <dbl> | <int> | <dbl> | <chr> |
| 8 | accuracy | binary | 0.959399 | 5 | 0.003759576 | Preprocessor1_Model08 |
# 1. SPEC WITH OUR CHOSEN K VALUE
final_pulsar_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 8) |>
set_engine("kknn") |>
set_mode("classification")
# 2. PUTTING OUR WORKFLOW TOGETHER WITH THE RECIPE
pulsar_final_fit <- workflow() |>
add_recipe(pulsar_recipe) |>
add_model(final_pulsar_spec) |>
fit(data = pulsar_train_split)
# 3. PREDICTING WITH PULSAR TEST SET
prediction_pulsar <- predict(pulsar_final_fit, pulsar_test_split) |>
bind_cols(pulsar_test_split) |>
mutate(Correct = (class == .pred_class))
print("Table 8: Testing dataset with prediction_pulsar results.")
prediction_pulsar
# 4. EVALUATION OF OUR MODEL'S ACCURACY WHEN IT PREDICTS ON OUR TEST SET
test_set_accuracy <- prediction_pulsar |>
metrics(truth = class, estimate = .pred_class) |>
filter(.metric == "accuracy") |>
select(.estimate) |>
pull()
print("Model accuracy when predicting on testing dataset.")
test_set_accuracy
# 5. CREATE A CONFUSION MATRIX OF OUR MODEL
conf_mat_test <- prediction_pulsar |>
conf_mat(truth = class, estimate = .pred_class)
print("Model confusion matrix when predicting on testing dataset.")
conf_mat_test
[1] "Table 8: Testing dataset with prediction_pulsar results."
| .pred_class | mean_ip | std_ip | kurt_ip | skew_ip | mean_dm_snr | std_dm_snr | kurt_dm_snr | skew_dm_snr | class | Correct |
|---|---|---|---|---|---|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <lgl> |
| 0 | 140.56250 | 55.68378 | -0.2345714 | -0.6996484 | 3.199833 | 19.11043 | 7.975532 | 74.24222 | 0 | TRUE |
| 0 | 88.72656 | 40.67223 | 0.6008661 | 1.1234917 | 1.178930 | 11.46872 | 14.269573 | 252.56731 | 0 | TRUE |
| 0 | 142.07812 | 45.28807 | -0.3203284 | 0.2839525 | 5.376254 | 29.00990 | 6.076266 | 37.83139 | 0 | TRUE |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 0 | 90.14844 | 40.78604 | 0.45713856 | 0.88551621 | 5.3545150 | 29.759511 | 6.102231 | 37.97952 | 0 | TRUE |
| 0 | 121.37500 | 53.24516 | 0.10377221 | -0.36511875 | 1.0953177 | 12.239976 | 16.258042 | 303.88002 | 0 | TRUE |
| 0 | 118.48438 | 50.60848 | -0.02905883 | -0.02749376 | 0.4222408 | 8.086684 | 27.446113 | 830.63855 | 0 | TRUE |
[1] "Model accuracy when predicting on testing dataset."
[1] "Model confusion matrix when predicting on testing dataset."
Truth
Prediction 0 1
0 3931 37
1 128 379
A bar chart is chosen for visualizing the proportions of correct and incorrect predictions for both the classes. This visually shows the estimated accuracy of the model.
# LOOKING AT THE PROPORTION OF OUR PREDICTIONS ON TEST DATA THAT ARE CORRECT
pulsar_prediction_data <- prediction_pulsar |>
mutate(Correct = as_factor(Correct)) |>
group_by(Correct, class) |>
summarize(count = n(), prop = count/ nrow(prediction_pulsar))
pulsar_proportions_plot <- ggplot(pulsar_prediction_data, aes(x=class, y=count, fill = as_factor(Correct)))+
geom_bar(position = "fill", stat = "identity") +
labs(fill = "Correct Predictions",
caption = "Figure 4: Correct predicted results proportion for testing dataset.",
x = "Pulsar Star \n (0 = Non-pulsar), (1 = Pulsar)",
y = "Proportion of Correct Predictions",
title = "Prediction Proportions for Test Data") +
theme(text = element_text(size = 20),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5, size = 12))
pulsar_proportions_plot
`summarise()` has grouped output by 'Correct'. You can override using the `.groups` argument.
The colour-shaded visualization is used to picture how the classifier uses the values of the two predictor variables, kurt_ip and mean_dm_snr, to predict the class of unknown data points. An unknown data point falling within the orange region is predicted to be a non-pulsar signal and an unknown data point falling within the blue region is predicted to be a true pulsar. It can be seen that the pulsar class predictions, the blue region, are spread over a much larger range of kurt_ip as compared to the non-pulsar class predictions, the orange region.
# FINAL VISUALIZATION OF THE ANALYSIS, LOOKING AT HOW OUR CLASSIFIER WORKS WITH PREDICTOR VARIABLES KURT_IP AND MEAN_DM_SNR
kurt_ip_grid <- seq(min(pulsar_train_split$kurt_ip),
max(pulsar_train_split$kurt_ip),
length.out = 100)
mean_dm_snr_grid <- seq(min(pulsar_train_split$mean_dm_snr),
max(pulsar_train_split$mean_dm_snr),
length.out = 100)
asgrid <- as_tibble(expand.grid(kurt_ip = kurt_ip_grid,
mean_dm_snr = mean_dm_snr_grid))
pulsar_prediction <- predict(pulsar_final_fit, asgrid)
pulsar_prediction_table <- bind_cols(pulsar_prediction, asgrid) |>
rename(class = .pred_class)
pulsar_class_decision_plot <- ggplot() +
geom_point(data = pulsar_train_split,
mapping = aes(x = kurt_ip, y = mean_dm_snr, color = class),
alpha = 0.75) +
geom_point(data = pulsar_prediction_table,
mapping = aes(x = kurt_ip, y = mean_dm_snr, color = class),
alpha = 0.1,
size = 5) +
labs(color = "Class",
x = "Kurtosis IP",
y = "Mean DM-SNR",
caption = "Figure 5: Classifier decisions (background colour) compared to training dataset (points).",
title = "Classifier Decision based on Mean DM-SNR vs. Kurtosis IP") +
scale_color_hue(labels = c("Non-pulsar", "Pulsar")) +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5, size = 12))
pulsar_class_decision_plot
Data analysis showed that the kurtosis of the integrated pulse profile and the mean of the DM-SNR curve of the detected signal of a potential pulsar star candidate are reliable for classifying whether the detected signal is a real pulsar star. Our classification model that took kurtosis of the integrated pulse profile and mean of the DM-SNR curve into account has an estimated accuracy of 96.42% on our test dataset. Compared to the accuracy of the majority classifier of 50% (because we scaled it down to have equal proportions of each class), the estimated accuracy of the classification model is significantly higher, and the model developed in this project shows that you can clearly predict whether a signal is a pulstar or not quite accurately. Figure 4 shows the model accuracy. It has well over 75% accurate classification accuracy for both class 0 (non-pulsar) and class 1 (pulsar) signals.
Looking at the boxplots of the DM-SNR curves in Fig. 2, it can be seen that the majority of non-pulsar stars (class 0) have a zero mean, whereas the majority of pulsar stars (class 1) have a more spread out distribution of means that is non-zero. This finding is predictable and consistent with previous literature from the study of DM-SNR curves for pulsar stars, stating that “a legitimate pulsar signal…will be dispersed…[and] should peak at a DM greater than zero” (Lyon, 2016). Moreover, previous literature has identified pulsar stars with extreme kurtosis distributions that are 'less pointed' and 'broader' than non-pulsar stars with “distinguishable peaks” (Amarouchous, 2018). Looking at the extreme kurtosis distribution of the integrated profile in Fig. 2, it is clear that the non-pulsar stars have a much narrower distribution than the pulsar stars, which have a relatively flat and broad distribution, and Amarouchous. corroborates the findings.
It is not surprising that the classifier would produce results with such high accuracy when using the excess kurtosis of the integrated profile as a predictor variable. Excess kurtosis is a statistic with “good antinoise ability” means that it may effectively and precisely aid in determining whether a signal is from a pulsar star or not by filtering out spurious signals produced by background noise that pulsar star detectors pick up on (Zhao et al., 2022). This makes it a particularly good predictor variable for a pulsar star classification model.
Astronomers believe that the excellent accuracy of this classification model, along with the identification of the mean of the DM-SNR curve and excess kurtosis of the integrated pulse profile as reliable prediction variables for pulsar stars, would be of great use. As stated in the introduction, pulsar stars are crucial for the investigation of the most extreme states of matter, the search for planets outside of our solar system, the determination of distances in space, and perhaps even the investigation of blackholes. The effectiveness of astronomers and scientists working with pulsar star signals can be increased by having an accurate categorization model that can swiftly filter for pulsar stars, opening up additional opportunities for discoveries about the solar system and matter.
Future inquiries can focus on why pulsar stars have particular statistical traits, such as why the distribution for excess kurtosis is wider for pulsar stars. In order to classify pulsar stars more accurately in the future, more researchers may be interested in researching each of the statistical traits in greater detail and attempting to establish a threshold value or range for each continuous variable observed in the HTRU2 dataset.
Overall, whether utilised to identify detected signals as non-pulsar or pulsar stars, our approach offers a high estimated accuracy. This experiment supports the notion that a signal's identification may be inferred from its integrated profile's DM-SNR curve, notably from its mean and excess kurtosis (as either pulsar or non-pulsar). This study shows that it is possible to identify pulsar stars using these statistical properties, which could result in more discoveries and more fruitful future astronomical and natural science research.
Needles in a haystack: finding pulsars in cosmological data. June 10, 2018. Retrieved December 1, 2022, from https://dataincognita.net/2018/04/23/needles-in-a-haystack-finding-pulsars-in-cosmological-data/
What are pulsars? May 24, 2022. Retrieved November 29, 2022, from https://www.livescience.com/what-are-pulsars
An Introduction to Pulsars. N/A. Retrieved December 2, 2022, from https://www.atnf.csiro.au/outreach/education/everyone/pulsars/index.html
Astronomy & Astrophysics. 2011, May 13. Retrieved December 1, 2022, from https://www.aanda.org/articles/aa/full/2006/40/aa6029-06/aa6029-06.html
Measures of skewness and Kurtosis. n.d. N/A. Retrieved December 2, 2022, from https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm